Forensic Science International: Genetics — Latest Matching Preprints

1

Calculating likelihoods and likelihood ratios at SNPs-based mixtures. A reappraisal of the binomial inference, as applied to forensic identity tests

Pascali, V. L.

2021-02-08 genetics 10.1101/2021.02.08.430218 medRxiv

Top 0.1%

91.1%

Show abstract

Single nucleotide polymorphisms (SNPs) are useful forensic markers. When a SNPs-based forensic protocol targets a body fluid stain, it returns elementary evidence regardless of the number of individuals that might have contributed to the stain deposition. Therefore, drawing inference from a mixed stain with SNPs is different than drawing it while using multinomial polymorphisms. We here revisit this subject, with a view to contribute to a fresher insight into it. First, we manage to model conditional semi-continuous likelihoods in terms of matrices of genotype permutations vs number of contributors (NTZsc). Secondly, we redefine some algebraic formulas to approach the semi-continuous calculation. To address allelic dropouts, we introduce a peak height ratio index ( h, or: the minor read divided by the major read at any NGS-based typing result) into the semi-continuous formulas, for they to act as an acceptable proxy of the split drop (Haned et al, 2012) model of calculation. Secondly, we introduce a new, empirical method to deduct the expected quantitative ratio at which the contributors of a mixture have originally mixed and the observed ratio generated by each genotype combination at each locus. Compliance between observed and expected quantity ratios is measured in terms of (1-{chi}2) values at each state of a locus deconvolution. These probability values are multiplied, along with the h index, to the relevant population probabilities to weigh the overall plausibility of each combination according to the quantitative perspective. We compare calculation performances of our empirical procedure (NITZq) with those of the EUROFORMIX software ver. 3.0.3. NITZq generates LR values a few orders of magnitude lower than EUROFORMIX when true contributors are used as POIs, but much lower LR values when false contributors are used as POIs. NITZ calculation routines may be useful, especially in combination with mass genomics typing protocols.

2

Improving the Accuracy of Forensic Age Estimation Through Bias Reduction

Flores, M.; Pellegrini, M.

2026-06-03 bioinformatics 10.64898/2026.05.30.728628 medRxiv

Top 0.1%

89.5%

Show abstract

Chronological age estimation can provide supporting information in forensic casework when traditional identification methods are limited. DNA methylation, a stable epigenetic mark, has emerged as a promising tool for predicting chronological age from trace samples. However, many existing age estimation models rely on linear regression approaches, which often yield biased prediction errors across the age distribution (i.e. model residuals show a significant age dependence). In this study, we compared three approaches for age estimation modeling: multivariable linear regression, random forest regression and maximum likelihood estimation. While the first two approaches are well established, for the third one we constructed and validated a DNA methylation-based LOESS regression maximum likelihood model for age estimation utilizing forensic-relevant CpG markers. In all cases, model performance was evaluated through Leave-One-Out Cross-Validation (LOOCV). We utilized three independent publicly accessible methylation datasets collected using droplet digital PCR (ddPCR) to evaluate the most effective method for accuracy and bias in age estimation. Notably, when we compare the results of the maximum likelihood approach to the other approaches, multivariable linear regression and random forest regression, we find less bias in the age associated residuals compared to the other methods. These findings highlight the utility of non-linear modeling techniques in reducing the biases of epigenetic age estimation for forensic applications.

3

SNP assays for DVI: cost, time, and performance information for decision-makers

Gettings, K. B.; Tillmar, A.; Marshall, C.; Sturk-Andreaggi, K.

2024-05-11 molecular biology 10.1101/2024.05.10.593619 medRxiv

Top 0.1%

87.7%

Show abstract

In mass disaster events, forensic DNA laboratories may be called upon to quickly pivot their operations toward identifying bodies and reuniting remains with family members. Ideally, laboratories have considered this possibility in advance and have a plan in place. Compared with traditional short tandem repeat (STR) typing, single nucleotide polymorphisms (SNPs) may be better suited to these disaster victim identification (DVI) scenarios due to their small genomic target size, resulting in an improved success rate in degraded DNA samples. As the landscape of technology has shifted toward DNA sequencing, many forensic laboratories now have benchtop instruments available for massively parallel sequencing (MPS), facilitating this operational pivot from routine forensic STR casework to DVI SNP typing. Herein, we review the commercially available SNP sequencing assays amenable to DVI, we use data simulations to explore the potential for kinship prediction from SNP panels of varying size, and we give an example DVI scenario as context for presenting the matrix of considerations: kinship predictive potential, cost, and throughput of current SNP assay options. This information is intended to assist laboratories in choosing a SNP system for disaster preparedness. Highlights3 to 5 bullet points (maximum 100 characters per bullet point, including spaces). Each bullet point should be a full sentence and should outline the key contributions of your manuscript and how it impacts forensic science. O_LISingle nucleotide polymorphisms (SNPs) are useful in disaster victim identification (DVI). C_LIO_LISNP panels amenable to human identification and extended kinship are described. C_LIO_LISimulations demonstrate the potential for kinship prediction from SNP panels of varying size. C_LIO_LIKinship predictive potential, cost, and throughput are presented for an example DVI scenario. C_LIO_LIInformation is intended to assist laboratories in choosing a SNP system for disaster preparedness. C_LI

4

Using all available evidence to solve kinship cases

Egeland, T.; Marsico, F.

2025-05-08 genetics 10.1101/2025.05.03.652046 medRxiv

Top 0.1%

85.6%

Show abstract

Kinship cases, ranging from standard paternity tests to complex disaster victim identifications, are typically evaluated using likelihood ratios (LR) based on forensic genetic markers. However, in some contexts, genetic information alone is not enough to reach conclusive results. This is common when establishing distant familial connections using large DNA-databases, or even in simple cases such as determining which individual is the parent and which is the child in a relationship pair. Although forensic practitioners frequently incorporate additional evidence (SE), such as age, biological sex, or phenotypic traits, in these cases, this integration typically occurs informally, without rigorous probability estimation, compromising procedural transparency and reliability. Here, we present a comprehensive methodological framework that formally synthesizes forensic DNA evidence (FDE) with SE through Markov chain models and customized transition matrices designed for various biological traits. This approach generates combined likelihood assessments expressed as LRs or posterior probabilities. Validation through simulated and real-world case studies demonstrates that systematic incorporation of SE improves resolution accuracy in kinship determinations. To facilitate adoption, we have implemented this methodology in mispitools, an open-source R package.

5

Assessment of DNA methylation from a single genomic region of ELOV2 is sufficient to predict chronological age

Zhu, B.; Li, D.; Han, G.; Yao, X.; Gu, H.; Liu, T.; Liu, L.; Dai, J.; Liu, I. Z.; Liang, Y.; Zheng, J.; Sun, Z.; Lin, H.; Wang, W.; Liu, N.; Yu, H.; Shi, M.; Shen, G.; Qu, L.

2024-12-15 genetics 10.1101/2024.12.10.627662 medRxiv

Top 0.1%

79.5%

Show abstract

Estimation of chronological age is particularly informative in forensic contexts. Assessment of DNA methylation status allows for the prediction of age, though the accuracy and ease of manipulation may vary across different models. In this study, we started with a carefully designed discovery cohort recruiting more elderly subjects than other age categories, to diminish the effect of epigenetic drifting. We analyzed DNA methylation from a single genomic region of ELOV2, which was sufficient to construct an age-prediction model comprising 15 CpG sites. This model is further validated by an independent cohort as well as a multi-center test using trace dried bloodstains. The nature of our analytical pipeline, when combined the assessment of a single genomic locus with high-throughput sequencing, can easily be scaled up with low cost. Taken together, we propose a new age-prediction model featuring accuracy, ease of manipulation, high-throughput, and low cost. This model can be readily applied in both classic and newly emergent forensic contexts that require age estimation.

6

Likelihood Ratios for physical traits in forensicinvestigations

Marsico, F.; Egeland, T.

2024-05-30 genetics 10.1101/2024.05.25.595720 medRxiv

Top 0.1%

76.1%

Show abstract

Recent years have seen significant advances in DNA phenotyping, which predicts the physical traits of an unknown person, such as hair, eyes, and skin color, using DNA data. This technique is increasingly used in forensic investigations to identify missing persons, disaster victims, and suspects of crimes. A key contribution of DNA phenotyping is that it allows researchers to search through lists of individuals with similar characteristics, often gathered from testimonies, photographs, and social media data. However, despite their growing relevance, current methods lack comprehensive mathematical models to calculate likelihood ratios that accurately assess the statistical weight of evidence. Our work bridges this gap by developing new likelihood ratio models, validated through computational simulations. In addition, we demonstrate the ability of these models to improve forensic investigations in real-world scenarios. Furthermore, we introduce the R package forensicolors, freely available on CRAN, to facilitate the application of the methodologies developed.

7

Likelihood Ratios Given Activity-Level Propositions for DNA Transfer Evidence: Practical Implementation and Simulation Studies Using the HaloGen Engine (Part II)

Gill, P.; Bleka, O.

2026-02-09 genetics 10.64898/2026.02.06.703509 medRxiv

Top 0.1%

74.1%

Show abstract

The interpretation of findings of low-template DNA given activity-level propositions requires robust statistical models capable of accommodating substantial inter-laboratory and case-specific variability. This paper presents the practical implementation of HaloGen, an open-source hierarchical Bayesian framework for calculating activity-level likelihood ratios (LRs) from DNA quantity data. We compare three modelling approaches derived from the framework: a Group model, which combines data across laboratories, a hierarchically informed Lab-Bayes model, and a standalone, laboratory specific Lab-Vague model. Through a series of simulation studies, we demonstrate that evidential strength is highly sensitive not only to DNA quantity but also to case context, particularly the assumed number of offenders (NS). We further show that inter-laboratory differences in DNA recovery and dropout can lead to materially different LRs, making unvalidated use of pooled or external data potentially misleading. To address practical implementation, we propose a minimum-effort validation pathway for laboratories wanting to report findings given activity level propositions. Our results indicate that a small number of direct/secondary transfer experiments (n {approx} 6- 12) are sufficient to obtain conservative LRs compared with a generic population model. Finally, these results clarify how contextual assumptions enter mathematically into activity-level inference, demonstrating that confirmation bias can arise naturally from unexamined modelling choices and underscoring the importance of transparent, explicit specification of propositions and parameters.

8

Secondary DNA transfer on denim using a human blood analogue

Ridings, R.; Gabriel, A.; Elliott, C. I.; Shafer, A.

2021-11-25 genetics 10.1101/2021.11.25.470033 medRxiv

Top 0.1%

69.7%

Show abstract

DNA quantification technology has increased in accuracy and sensitivity, now allowing for detection and profiling of trace DNA. Secondary DNA transfer occurs when DNA is deposited via an intermediary source (e.g. clothing, tools, utensils). Multiple courtrooms have now seen secondary transfer introduced as an explanation for DNA being present at a crime scene, but sparse experimental studies mean expert opinions are often limited. Here, we used bovine blood and indigo denim substrates to quantify the amount of secondary DNA transfer and quality of STRs under three different physical contact scenarios: passive, pressure, and friction. We showed that the DNA transfer was highest under a friction scenario, followed by pressure and passive treatments. The STR profiles showed a similar, albeit less pronounced trend, with correctly scored alleles and genotype completeness being highest under a friction scenario, followed by pressure and passive. DNA on the primary substrate showed a decrease in concentration and genotype completeness both immediately and at 24 hours, suggestive of a loss of DNA during the primary transfer. The majority of secondary transfer samples amplified less than 50% of STR loci regardless of contact type. This study showed that while DNA transfer is common between denim, this is not manifested in full STR profiles. We discuss the possible technical solutions to partial profiles from trace DNA, and more broadly the ubiquity of secondary DNA transfer.

9

Likelihood Ratios Given Activity-Level Propositions for DNA Transfer Evidence: Theoretical Foundations of the HaloGen Framework (Part I)

Gill, P.; Bleka, O.

2026-02-05 genetics 10.64898/2026.02.03.702484 medRxiv

Top 0.1%

66.9%

Show abstract

The interpretation of trace DNA evidence at activity level requires explicit modelling of transfer, persistence, and failure to detect a person of interest. We present the theoretical foundations of HaloGen, an open-source hierarchical Bayesian framework for evaluating biological results under competing activity-level propositions, such as direct versus secondary transfer. HaloGen accounts for dropout, multiple contributors, and multiple stains. Evidence is evaluated using an exhaustive-propositions likelihood ratio frame-work that combines information across contributors and stains, while fully accounting for uncertainty in transfer and detection. Observed DNA quantities and non-detects are handled consistently within a single probabilistic model, avoiding reliance on fixed parameter estimates. The framework yields intuitive and robust behaviour: strong support for direct transfer when DNA quantities are informative, and appropriately neutral or defence-leaning likelihood ratios in low-information or non-detect scenarios. An empirically constrained fail-rate parameter prevents spurious inflation of likelihood ratios when offender detection is unlikely, providing stability across laboratories and experimental conditions. This paper establishes the theoretical basis of HaloGen; a companion paper addresses validation and applied casework examples.

10

Archaeogenomic and Bioinformatic Analysis of the Columbus Lineage: Evidence from the Counts of Gelves.

Navarro Vera, I.; Bonilla, A.; Tirapu, M.; Albert, M.; Jimenez, P. P.; Herranz-Rodrigo, D.; Cruz-Alcazar, R.; Garcia, C.; Yravedra Sainz de los Terreros, J.

2026-04-04 genetics 10.64898/2026.04.01.715912 medRxiv

Top 0.1%

55.1%

Show abstract

The geographical and familial origins of Christopher Columbus have remained a subject of intense historiographical debate for over five centuries. Despite numerous hypotheses, empirical genetic evidence capable of resolving his ancestral history or place of birth has been absent from the literature until now. This study presents the third stage of the first forensic genetic analysis performed on skeletal remains belonging to several direct descendants of Columbus, spanning the 16th to 18th centuries. By applying Massively Parallel Sequencing (MPS) to analyse autosomal, X- and Y- chromosome DNA markers, and integrating the results with multidisciplinary evidence from historical, genealogical, archaeological, and anthropological research implicated in this project, the identification of several individuals founded in the Crypt of Santa Maria de Gracia located in Gelves (Sevilla, Spain) has been achieved. The analysis of their biological relatedness enabled the reconstruction of kinship networks among the individuals interred in the crypt, which, when interpreted in the context of documented genealogical lineages, provides indirect but consistent evidence pointing toward the debated origin of the discoverer.

11

Whole-genome sequencing of a mid-20th-century femur from central Israel in an open missing-person case

Vol, E.; Waldman, S.; Lomes, A.; Brielle, E. S.; Appel, N.; Dolin, B.; Asif, S.; Nagar, Y.; Marco, E.; Bergman, N.; Khaner, O.; Raviv, D.; Oliel, J.; Lewis, R. Y.; Carmi, S.

2026-04-28 genetics 10.64898/2026.04.24.720291 medRxiv

Top 0.1%

47.2%

Show abstract

Genome-wide technologies can generate investigative leads in cold cases by determining the genetic ancestry of the forensic sample. Increasingly, DNA extraction and whole-genome sequencing or genotyping are being used to analyze early or middle-20th century skeletal remains. Here, we present the first case, to our knowledge, of whole-genome sequencing of a middle-20th-century bone sample from the Middle East. A femur discovered in a cave in Central Israel was proposed to belong to a person of Ashkenazi Jewish ancestry who was missing since 1948. Following DNA extraction and single-stranded library preparation, whole-genome sequencing generated nearly 500 million reads. However, only 0.5% of the reads mapped to the human genome, providing depth of coverage of 0.07x. After quality control and male sex inference, ancestry assignment was performed using principal components and ADMIXTURE analyses. The results suggested that the genome definitively belonged to a person of Arab ancestry, refuting the hypothesis of an Ashkenazi Jewish origin.

12

Standardising a microbiome pipeline for body fluid identification from complex crime scene stains

Swayambhu, M.; Gysi, M.; Haas, C.; Schuh, L.; Walser, L.; Javanmard, F.; Flury, T.; Ahannach, S.; Lebeer, S.; Hanssen, E. N.; Snipen, L.; Bokulich, N.; Kuemmerli, R.; Arora, N.

2024-08-07 bioinformatics 10.1101/2024.08.05.604586 medRxiv

Top 0.1%

46.3%

Show abstract

BackgroundRecent advances in next-generation sequencing have opened up new possibilities for utilizing the human microbiome in various fields, including forensics. Researchers have capitalized on the site-specific microbial communities found in different parts of the body to identify body fluids from biological evidence. Despite promising results, microbiome-based methods have not yet been fully integrated into forensic practice due to the lack of standardized protocols and systematic testing of methods on forensically relevant samples. Our study addresses critical decisions in establishing these protocols, focusing on bioinformatics choices and the use of machine learning to present microbiome results in court for forensically relevant and challenging samples. ResultsWe propose using Operational Taxonomic Units (OTUs) for read data processing and creating heterogeneous training datasets for training a random forest classifier. Our classifier incorporates six forensically relevant classes: saliva, semen, hand skin, penile skin, urine, and vaginal/menstrual fluid. Across these classes, our classifier achieved a high weighted average F1 score of 0.89. Systematic testing on mixed-source samples and underwear revealed reliable detection of at least one component of the mixture and the identification of vaginal fluid from underwear substrates. Additionally, when investigating the sexually shared microbiome (sexome) of heterosexual couples, our classifier shows promising results for the inference of sexual activity. ConclusionIn our study, we recommend the use of a novel random forest classifier trained on a heterogenous dataset for obtaining predictions from samples mimicking forensic evidence. We also highlight the potential of the sexome for assessing the nature of sexual activities in forensic investigations, while delineating areas that warrant further research. Furthermore, we underscore key considerations when presenting machine learning results for classifying mixed-source samples.

13

Open-Access STRS Database Of Populations From The 1000 Genomes Project Using High Coverage Phase 3 Data

Frontanilla, T. S.; Valle Silva, G.; Ayala, J.; Mendes, C. T.

2021-09-07 genetics 10.1101/2021.09.06.459168 medRxiv

Top 0.1%

45.4%

Show abstract

Accurate STR genotyping from next-generation sequencing (NGS) data has been challenging. Haplotype inference and phasing for STRs (HipSTR) was specifically developed to deal with genotyping errors and obtain reliable STR genotypes from whole-genome sequencing datasets. The objective of this investigation was to perform a comprehensive genotyping analysis of a set of STRs of broad forensic interest from the 1000 Genomes populations and release a reliable open-access STR database to the forensic genetics community. A set of 22 STR markers were analyzed using the CRAM files of the 1000 Genomes Project Phase 3 high-coverage (30x) dataset generated by the New York Genome Center (NYGC). HipSTR was used to call genotypes from 2,504 samples from 26 populations organized into five groups: African, East Asian, European, South Asian, and admixed American. The D21S11 marker could not be detected in the present study. Moreover, the Hardy-Weinberg equilibrium analysis, coupled with a comprehensive analysis of allele frequencies, revealed that HipSTR could not identify longer Penta E (and Penta D at a lesser extent) alleles. This issue is probably due to the limited length of sequencing reads available for genotype calling, resulting in heterozygote deficiency. Notwithstanding that, AMOVA, a clustering analysis using STRUCTURE, and a Principal Coordinates Analysis revealed a clear-cut separation between the four major ancestries sampled by the 1000 Genomes Consortium (AFR, EUR, EAS, SAS). Meanwhile, the AMOVA results corroborated previous reports that most of the variance is (97.12%) observed within populations. This set of analyses revealed that except for larger Penta D and Penta E alleles, allele frequencies and genotypes defined by HipSTR from the 1000 Genomes Project phase 3 data and offered as an open-access database are consistent and highly reliable.

14

Ancient genomes from the siege and destruction of Middle Bronze Age Roca Vecchia (Apulia, Italy) shed light on Aegean contacts and conflicts

Aneli, S.; Nicolini, V.; Vincenti, G.; Montinaro, F.; Sasso, S.; Saupe, T.; Kabral, H.; Solnik, A.; Tambets, K.; Guglielmino, R.; Fabbri, P. F.; Pagani, L.

2025-12-17 genetics 10.64898/2025.12.15.694319 medRxiv

Top 0.1%

42.4%

Show abstract

BackgroundRoca Vecchia, an iconic Bronze Age stronghold in Apulia, Southern Italy, was completely destroyed during a siege between the end of the 15th century BCE and the beginning of the 14th century BCE. During the siege, seven of the local people hid within the stronghold walls. Two others, who could have been as well attackers as defenders, were found under the ruins of the main gate. The material culture found at Roca Vecchia and associated with the period of the siege includes Minoan-type pottery produced from local clay, imported Aegean pottery and an Aegean-type dagger, pointing to an established relationship between the site and the Minoan civilization. Therefore, the site offers an unprecedented opportunity to characterise the genetic components of the population inhabiting an indigenous settlement with increasing contacts with the Aegean world, and to shed light on the demic or cultural modes of the Minoan presence in the central Mediterranean. ResultsWith our work, we sampled six out of nine available unburied Middle Bronze Age individuals, contemporary with the siege and destruction of the site, and obtained genome-wide information for two individuals. When compared with available Minoan, Apulian and broadly Mediterranean genomes, the individuals showed a characteristic Bronze/Iron Age Italian peninsula genetic signature, with limited contribution from Minoans. ConclusionsWe conclude that the local population of Roca Vecchia, at the moment of the siege, was predominantly autochthonous, with a minoritarian Minoan component. A Minoan genetic signal is indeed likely present in one out of two analysed individuals who were certainly part of the dwellers of Roca Vecchia. This confirms previous hypotheses supposing that a nucleus of "foreigners" coming from the Minoan world was living in the site and mixed with locals. Archaeological data suggest that the Roca Vecchia Aegean population component probably increased in the following centuries.

15

Comparative Evaluation of Targeted RNA Sequencing Protocols for Gene Expression Quantification With and Without Unique Molecular Indices (UMIs)

Gosch, A.; Courts, C.

2025-01-27 molecular biology 10.1101/2025.01.27.635010 medRxiv

Top 0.1%

41.0%

Show abstract

Interest in forensic RNA analysis has increased over the last years. RNA molecules present in forensic samples can accurately be quantified via quantitative PCR (qPCR), however, due to the limited number of markers that can be assayed simultaneously per reaction, qPCR is less suitable for applications requiring gene expression quantification of large marker sets. Few years ago, massively parallel targeted RNA-sequencing (targRNAseq) allowing to simultaneously and accurately quantify several hundreds of markers has been added to the forensic genetic tool set. However, typical targRNAseq protocols include a multiplex-PCR-step to amplify selected targets which potentially introduces bias and limits accurate gene expression quantification. Unique Molecular Indices (UMIs) have been invented to overcome this limitation and have been implemented in protocols from some vendors. In this study, we compared two targeted RNAseq protocols assaying expression of a set of 121 forensically relevant mRNA biomarkers: The Ion Ampliseq targeted RNA sequencing panel (Thermo Fisher Scientific), which employs a multiplex-PCR without the use of UMIs, and the QIAseq targeted RNA panel (QIAGEN), which uses UMIs prior to multiplex amplification. Both protocols were tested on replicated samples and dilution series and compared with respect to sensitivity and accuracy of gene expression quantification. The UMI-based protocol exhibited decreased sensitivity in comparison to the non-UMI-based alternative, however, making use of UMI technology greatly improved gene expression quantification accuracy. We thus recommend the use of UMI-based protocols for targeted RNA sequencing for applications requiring accurate gene expression quantification.

16

StrPhaser constructs tandem repeat alleles from VCF data

Wang, X.; King, J.; Meng, H.; Coble, M. D.; Woerner, A. E.

2025-01-25 bioinformatics 10.1101/2025.01.22.634325 medRxiv

Top 0.1%

40.5%

Show abstract

Variant calling is a ubiquitous genomic technique that underpins many scientific disciplines. From a computational perspective, variant calling is a form of logical compression; neglecting large variation, a persons genome can be losslessly described as a set of differences (SNP and small InDel alleles) relative to the reference sequence. Another common genomic technique is haplotype phasing, wherein alleles are partitioned into their paternal and maternal components (as haplotypes). Some classes of alleles are more difficult to describe than others, e.g., short tandem repeats (STRs). STRs serve as a critical marker for many genetic assays. However, STRs tend not to be explicitly reported in most genomic workflows. Here, we present StrPhaser, a novel algorithm that leverages phased variant calling datasets in the VCF file format to construct STR alleles. We evaluated StrPhaser on [~]10,000 STR alleles from 284 human genomes, achieving an average allele accuracy of 91%. In addition, StrPhaser better recovers longer STR alleles than competing approaches; in principle, STR alleles that are longer than the maximum read length can be characterized. This capability, combined with its user-friendly interface, speed, and generation of both STR genotypes and visualizations, makes StrPhaser a valuable tool for a wide range of genomic studies. AvailabilityThe StrPhaser is publicly available at https://github.com/XuewenWangUGA/StrPhaser.

17

Revisiting the STRmixlikelihood ratio probability interval coverage considering multiple factors

Bright, J.-A.; Lee, S.-I.; BUCKLETON, J.; Taylor, D. A.

2021-06-25 molecular biology 10.1101/2021.06.25.449960 medRxiv

Top 0.1%

40.5%

Show abstract

In previously reported work a method for applying a lower bound to the variation induced by the Monte Carlo effect was trialled. This is implemented in the widely used probabilistic genotyping system, STRmix. The approach did not give the desired 99% coverage. However, the method for assigning the lower bound to the MCMC variability is only one of a number of layers of conservativism applied in a typical application. We tested all but one of these sources of variability collectively and term the result the near global coverage. The near global coverage for all tested samples was greater than 99.5% for inclusionary average LRs of known donors. This suggests that when included in the probability interval method the other layers of conservativism are more than adequate to compensate for the intermittent underperformance of the MCMC variability component. Running for extended MCMC accepts was also shown to result in improved precision.

18

Rare instances of non-random dropout with the monochrome multiplex qPCR assay for mitochondrial DNA copy number

Yang, S. Y.; Newcomb, C. E.; Battle, S. L.; Hsieh, A. Y.; Chapman, H. L.; Cote, H. C.; Arking, D. E.

2021-10-11 genetics 10.1101/2021.10.11.463983 medRxiv

Top 0.1%

40.2%

Show abstract

Mitochondrial DNA copy number (mtDNA-CN) is a proxy for mitochondrial function and has been of increasing interest to the mitochondrial research community. There are a number of ways to measure mtDNA-CN, ranging from qPCR to whole genome sequencing [1]. A recent article in the Journal of Molecular Diagnostics [2] described a novel method for measuring mtDNA-CN that is both inexpensive and reproducible. After adapting the assay for use in our lab, we have found it to be reproducible and well-correlated with mtDNA-CN derived from whole genome sequencing. However, certain individuals show poor concordance between the two measures, particularly individuals with qPCR mtDNA-CN measurements >3 standard deviations below the sample mean, which corresponds to roughly 1% of assayed individuals (Figure 1). After examining whole genome sequencing data, this seems to be due to specific polymorphisms within the D-loop primer region, at positions MT 338, 340, 452, 457, 458, 460, 461, 466, and 467. All individuals with a variant in at least one of these positions have non-concordant mtDNA-CN measurements. Meanwhile, variants observed at other positions within the primer region do not appear to cause dropout. O_FIG O_LINKSMALLFIG WIDTH=200 HEIGHT=113 SRC="FIGDIR/small/463983v1_fig1.gif" ALT="Figure 1"> View larger version (16K): org.highwire.dtl.DTLVardef@134ca51org.highwire.dtl.DTLVardef@ce9196org.highwire.dtl.DTLVardef@1b82af6org.highwire.dtl.DTLVardef@c9701_HPS_FORMAT_FIGEXP M_FIG O_FLOATNOFigure 1.C_FLOATNO Discrepancy between the monochrome multiplex qPCR mtDNA-CN and the whole genome sequencing mtDNA-CN for 1,732 distinct individuals. Data are centered at 0 and scaled so that the standard deviation = 1. The dotted red line represents 3 standard deviations beneath the sample mean. Individuals in the U, L1, L4, and T haplogroups have a disproportionately higher risk of discordant measures between the two assays. C_FIG

19

Testing methods for quantifying Monte Carlo variation for categorical variables in Probabilistic Genotyping

Bright, J.-A.; Taylor, D. A.; Curran, J. M.; BUCKLETON, J.

2021-06-26 molecular biology 10.1101/2021.06.25.450000 medRxiv

Top 0.1%

39.9%

Show abstract

Two methods for applying a lower bound to the variation induced by the Monte Carlo effect are trialled. One of these is implemented in the widely used probabilistic genotyping system, STRmix. Neither approach is giving the desired 99% coverage. In some cases the coverage is much lower than the desired 99%. The discrepancy (i.e. the distance between the LR corresponding to the desired coverage and the LR observed coverage at 99%) is not large. For example, the discrepancy of 0.23 for approach 1 suggests the lower bounds should be moved downwards by a factor of 1.7 to achieve the desired 99% coverage. Although less effective than desired these methods provide a layer of conservatism that is additional to the other layers. These other layers are from factors such as the conservatism within the sub-population model, the choice of conservative measures of co-ancestry, the consideration of relatives within the population and the resampling method used for allele probabilities, all of which tend to understate the strength of the findings. HighlightsO_LITwo methods for quantifying Monte Carlo variability are tested, C_LIO_LIBoth give less than the desired 99% coverage, C_LIO_LIThe magnitude of possible discrepancy is small, C_LIO_LIFor example an LR of 4.3 x 1011 could be reported as 1.8 x 1012 C_LIO_LIAn LR of 18 could be reported as 22. C_LI

20

Mispitools: An R Package for Comprehensive Statistical Methods in Kinship Inference

Marsico, F.

2024-08-19 genetics 10.1101/2024.08.16.608307 medRxiv

Top 0.1%

39.6%

Show abstract

The search for missing persons is a complex process that involves the comparison of data from two entities: unidentified persons (UP), who may be alive or deceased, and missing persons (MP), whose whereabouts are unknown. Although existing tools support DNA-based kinship analyses for the search, they typically do not integrate or statistically evaluate diverse lines of evidence collected throughout the investigative process. Examples of alternative lines of evidence are pigmentation traits, biological sex, and age, among others. The package Mispitools fills this gap by providing comprehensive statistical methods adapted to a holistic investigation workflow. Mispitools systematically assesses the data from each investigative stage, computing the statistical weight of various types of evidence through a likelihood ratio (LR) approach. It also provides models for combining obtained LRs. Furthermore, Mispitools offers customized visualizations and a user-friendly interface, broadening its applicability among forensic practitioners and genealogical researchers.